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We derive information bounds for the regression parameters in 
^isj , Cox models when data are missing at random. These calculations are 

of interest for understanding the behavior of efficient estimation in 
case-cohort designs, a type of two-phase design often used in cohort 
pH , studies. The derivations make use of key lemmas appearing in Robins, 

C/^ ■ Rotnitzky and Zhao [J. Amer. Statist. Assoc. 89 (1994) 846-866] and 

Robins, Hsieh and Newey [J. Roy. Statist. Soc. Ser. B 57 (1995) 
409-424], but in a form suited for our purposes here. We begin by 
summarizing the results of Robins, Rotnitzky and Zhao in a form that 
leads directly to the projection method which will be of use for our 
model of interest. We then proceed to derive new information bounds 
for the regression parameters of the Cox model with data Missing At 
Random (MAR). In the final section we exemplify our calculations 
with several models of interest in cohort studies, including an i.i.d. 
w-v , version of the classical case-cohort design of Prentice [Biometrika 73 

t:;j- . (1986) 1-11] and Self and Prentice [Ann. Statist. 16 (1988) 64-81]. 

o . 

Tjlj- , 1. Introduction. Models for missing data have been the subject of in- 

^D ' tense research over the past decade. In particular, the landmark paper of 

r~| ■ Robins, Rotnitzky and Zhao (1994) (hereafter RRZ) provides theoretical re- 

jrt ! suits for information bounds in semiparametric regression models with some 

^ ' covariates missing at random. RRZ studied extensively the special case 

where the model for the complete data is restricted only by specification 

of its mean, conditional on the covariates. They provided a brief treatment 

^ ' of the case where the full data model is the Cox regression model. In related 
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2 B. NAN, M. J. EMOND AND J. A. WELLNER 

work, Robins, Hsieh and Newey (1995) (hereafter RHN) provided informa- 
tion bounds for classical regression models with missing covariate data. 

Meanwhile, case-cohort and stratified case-cohort designs have become 
increasingly important and popular in epidemiology since the basic work of 
Prentice (1986) and Self and Prentice (1988). For reports of studies using 
these designs, see, for example. Bell, Hertz-Picciotto and Beaumont (2001), 
Dome, Chung, Bergemann, Umbricht, Saji, Carey, Grundy, Perlman, Breslow and Sukumar 
(1999), Margolis, Knauss and Bilker (2002), Mark, Qiao, Dawsey, Wu, Katki, Guntere, Fraumeni, Blot 
(2000), Rasmussen, Folsom, Catellier, Tsai, Garg and Eckfeldt (2001), Zeegers,Goldbohm and van der 
(2001) and Zeegers, Swaen, Kant, Goldbohm and van den Brandt (2001). These 
study designs correspond to missing data models, since complete data are 
collected only on a subsample of the study cohort. The currently used esti- 
mators for the Cox model with these designs are not known to be efficient, 
being based on pseudo-likelihoods or various ad hoc estimating equations. 
Because of the sheer volume of studies using these designs, it is becoming 
increasingly important to better understand the following: 

(1) What are the information bounds for these types of designs and models? 

(2) How much information is being lost by use of ad hoc estimators? 

(3) Is it possible to construct reasonable, easily computable estimators which 
achieve the information bounds? 

Our goal here is to begin to address the first two of these issues. 

We begin by reorganizing and summarizing some results appearing in RRZ 
and RHN. Our summary (in Section 2) is formulated in a way which will 
lead quickly to information bounds for the models of primary concern here, 
namely Cox regression models with missing data. Our new information 
bounds for the Cox regression model with missing data are presented in 
Section 3. The efficient scores are characterized in terms of the solution of 
an integral equation. 

In Section 4 the information bounds are calculated explicitly for particu- 
lar submodels in several special cases, including case-cohort and exposure- 
stratified case-cohort versions of the Cox model. Although it has been known 
for some time that pseudo-likelihood estimators are not semiparametrically 
efficient, our explicit calculations quantify the loss of efficiency, and also show 
that two-phase designs with stratified subsampling can partially recover the 
information that is lost due to missing data. 

Although we will not address question (3) in this paper, we note that for 
complex models such as those under study in Sections 3 and 4 of this paper, 
it is not uncommon for the calculation of information bounds to precede and 
aid in the development of efficient estimators. For example, the information 
bounds obtained by Sasieni (1992a, b) came seven or eight years before the 
development of efficient estimators for "partly linear" extensions of the Cox 
model in Huang (1999). Construction of efficient estimators for case-cohort 
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designs, with and without stratification, will be treated by the first author. 
For preliminary work in this direction, see Nan (2001). 

While our focus here is on information bounds rather than on construc- 
tion of estimators, we comment briefly here on work on the estimation side 
of the problem. Most of the recent work on estimators for missing data in the 
Cox model focuses on improvements of the pseudo-likelihood estimators of 
Self and Prentice (1988); see, for example, Borgan, Langholz, Samuelsen, Goldstein and Pogoda 
(2000), Chen and Lo (1999) and the methods developed for related missing 
data models in Chatterjee, Chen and Breslow (2003). 

2. Information bounds for models with missing data. We first give a brief 
review of the general setting for information bound calculations with missing 
data. This material is a reworking of important results in Robins, Rotnitzky and Zhao 
(1994) in a form suitable for our present calculations. Readers new to these 
calculations may also be interested in van der Vaart [(1998), pages 379-383] 
and Emond and Wellner (1995). 

The general setup in this article is as follows: we suppose that U^ is a ran- 
dom vector with distribution Q in the model Q: U^ represents the "full" or 
"complete data." The complete or "full data" model Q may be parametric, 
semiparametric or nonpar ametric, but in our examples it will be semipara- 
metric: Q = {Qe,ri : ^ € C M , r] G Ti.} where 6 is the parameter of primary 
interest and i] is an infinite-dimensional "nuisance parameter." The "ob- 
served data" is U, where typically C/° = {U^,U^), and then U= {U^,R) = 
{U^, 1) when the indicator variable R = l, and U = {Ui,R) = ([/{*, 0) when 
R = 0. The distribution of U is P, an element of the (induced) "observed 
data" model V. In our examples V is semiparametric, parametrized by {6, r/), 
where G C M'^ is the parameter of interest and r^ is a nuisance parame- 
ter. The goal is to find the information bound for estimation of 9 when t] is 
unknown based on observation of Ui, ... ,Un i.i.d. as C/ ~ Pg^r] £ T^- 

Here is the primary model of interest for which the information bound is 
derived in Sections 3-5. 

Example (The Cox model with missing covariates). Let T be a failure 
time, C be a censoring time and Z = {X, V) G M be a covariate vector which 
is not time dependent. The data X are missing at random, while Y = T AC, 
A = l[T<(7i and V and R are always observed, where R is an indicator of 
missingness as above. The full data are U^ = {Y,A,X,V) and the observed 
data are U = {Y, A, RX, V, R) in the general notation introduced above. Note 
that X may be missing by design, as in two-phase studies. In a two-phase 
study (y. A, V) is observed for all subjects in phase 1 of the study. [In some 
"classical" case-cohort designs, only {A,V) is observed in phase 1. We will 
treat this case briefiy in Section 3.3.] In phase 2 X is obtained on a subsample 
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of the subjects. The probabihty of being included in this subsample may 
depend on what was observed in phase 1. We are interested in estimating 
the effect on T of the covariate Z = {X, V). Let {T\Z) ~ F(-|Z) with density 
/ = fe,x, where 

1 - Fe,xit\z) = exp(-e^'^A(t)) so ^''f^'l = e^'^A(t), 

where A is the (Lebesgue) density of A. Also, let {C\Z) ~ G{-\Z) with density 
g, where 

9{c\z) 



l-G{c\z) 



Xg{c\z) and Kg{c\z) = / XciA^) 

Jo 



dt. 



We assume that T and C are conditionally independent given Z (nonin- 
formative censoring). Let Z ^ H with density h. Then Q is the set of all 
densities of the form 

= q{y,6,x,v) 



(2.1) 



f{y\z) I g{t\z)dt 



g{y\z) I f{t\z)dt 

{y,oo) 



1-5 



Hz) 



{e''^Xiy)feM-e'''Hy))iXG{y\z))'~'eM-^G{y\z))hiz). 



The regression parameter 9 is the parameter of interest, and the nuisance 
parameter r] is (A,Ag,/i). This is basically the model introduced by Cox 
(1972). The model V for the observed data is the set of all distributions 
with densities of the form 

p{r, y, 6, {r-x),v) = (7r(y, 6, v)q{y, 6, x, v)Y 

(2.2) / r ^i-r 

X (1 - 7r(y, 6, v)) / q{y, 6, x, v) dfi{x) 



where 7t{Y, A, V) = P{R = 1|C/?), with tt{Y, A,V)>a>0, and fi is a domi- 
nating measure on X. 

Now we give a brief introduction to efficiency calculations in general; for 
more details see Bickel, Klaassen, Ritov and Wellner (1993) or van der Vaart 
[(1998), pages 362-371]. 

2.1. Introduction to information bounds for semiparametric models. The 
information bound for estimation of 6 in the model V is equal to Ig~ . Here, 
Iq is the efficient information matrix for ^ in "P, given by 

(2.3) lS = Epil*slf)^Ep{i;^% 
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where Ig is the efhcient score for 9. The efficient score for in a model 

V = {Pe,ri : G 0, ?? E TC} with nuisance parameters r] is the (componentwise) 
projection of the vector of scores 1$ G {L2{P)) for 6 onto the orthogonal 
complement of the (closure of the linear span) of all scores for the nuisance 
parameters, "P^. Intuitively, when i] is unknown information about 9 can 
only come from that component of Iq that is statistically independent of 
variability in the data controlled by the nuisance parameter. This component 
is Iq . Formally, each component of the efficient score for 9 is orthogonal to 
all scores for nuisance parameters, where orthogonality is relative to the 
inner product {b{U) , a{U)) j^o ,p\ = Ep{ba} in the space L^{P), the space of 

all mean-zero square-integrable functions of U . Let V be the tangent space 
for V. V is the closure of the linear span of the scores of all submodels 
of V passing through P (see BKRW). The tangent space for a model can be 
thought of as the space of all components of its score, and the "size" of V 
corresponds to the amount of unknown information about V. For example, 
when V is completely nonpar ametric, V is all of i^l-^) ■ Now let Ve and V^i be 
the tangent spaces for the submodels of V where r] and 9 are assumed known, 
respectively. Analogous definitions hold for Qq and Q^. Then Vg + Vrj '^P 
and we may assume for our purposes that Vg + Vrf = V; see BKRW (page 76) 
for a discussion. The orthogonality condition described above for Iq € {V:tY 
is 

(2.4) II ^Vr, mLliP); 

that is, Ep{lgb) = for all h ^V-q, where the orgonality is componentwise 
(i.e., it holds for each component of the vector of functions Iq). Thus, our 
approach to calculating Iq and the resulting efficiency bound is to project Iq 
onto the orthocomplement V:^ of Vj^ in L2{P): 

(2.5) ll = U{ie\V^), 

where 11 denotes the projection operator; see, for example, Bickel, Klaassen, Ritov and Wellner 
[(1993), Appendix A. 2]. [Here the _L (orthogonal complement) denotes the 
ortho-complement in L^{P) or L2{Q), depending on the context.] The space 
Vz' is of further importance, because it contains all influence functions for 

all regular estimators of 9 in V. Note that l^ = n(/e|'P^) = b* if and only if 

(2.6) (6, le -b*) = for all b £ V^ . 

This implies that we can find the desired projection by proposing a guess b* 
for Iq and then showing that (2.6) holds. This requires some understanding 
of V:^ ■ However, this last requirement can be relaxed somewhat. Since Iq G 

V = V0 + Vr], we have 

me\V^) = Ii{ie\V n n|) = U{le\M n V^) 
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for any (closed) subspace M such that P C A4 C L^{P). So it is sufficient 
to be able to identify all b in some set M fl V^ , which might not be all of 

V:^. Note that V nV^ C MnV^ CV^ . This is essentially the approach 
of Robins, Rotnitzky and Zhao (1994); see also the discussion in van der 
Vaart [(1998), pages 379-383]. The approach proceeds by identifying an 
"intermediate" set /C = Al fl Pj-, and then knowing Iq G (JCY provides a 
general form for Iq. An expression for Iq will be obtained by finding the 
specific element of /C that is the projection of Iq onto /C. The next subsection 
provides the formal results necessary to carry out this approach. 

2.2. Information hounds with missing data. The following material is es- 
sentially a special case of results appearing in Robins, Rotnitzky and Zhao 
(1994). Throughout this subsection we take the complete data to be U^ = 
(C/?, U^) r^QeQ, with U^ Missing At Random (MAR) in the observed data: 
thus R is an indicator of whether C/g is observed with P(R = 1\U^) = P{R = 
l|[7j^) = 7r(C/{') with i^iUi) > o" > 0. If vr = vr^ is allowed to be unknown via 
parameters 7 € F, then we let TZ = {vr^ 17 G F} denote the model for the 
"missingness probabilities" vr. The observed data are U = ([/{*, U2,R)^{Ui,R)^~^ ■ 
{Ui,RU2,R)- Because there is a measurable map from {U^, R) to U, the tan- 
gent space for Q x TZ can be mapped to V via an operator A that we call 
the score operator. The tangent space Q x 7^ for Q x 7^ is just Q under our 
assumption that '7r(C/{') is known, and we indeed impose this assumption 
throughout the rest of this paper. 

Remark. We have not included R in U^ because the functions in Q do 
not depend on R. However, if vr were partially unknown, then the parameters 
of vr would be additional nuisance parameters, R would be included in U^, 
and Q would be replaced by Q + 7^ in Lemmas 2.1 and 2.2. 

For a € -^^(Q) define the (score) operator A:L2{Q) — > L2{P) by 
Aa{U) = (Aa)(C/) = E{a{U'^)\U} = Ra{U^) + (1 - R)E{a{U'^)\U^). 

Lemma 2.1. 

A. {Aa{U):aeQ}cV. 

B. The adjoint A'^ : L^{P) ^ L'^iQ) of A is given by A'^b{U^) = E{b{U)\U^} 
forbeL^iP). 

C. A^Aa = 7r(f/?)a([/0) + (l-7r(C/iO))£;[a|C/?] foraeL^iQ). 

D. The operator {AJ^ A)~^ is given by 

(A A) «-^^(^-^(^7o^^[«|f^i]- 
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Robins, Rotnitzky and Zhao (1994) denote the operator A A by m. Then 
C and D are special cases of their Propositions 8.2a2 and 8.2e. The next 
lemma is key since it identifies 'P^ once Qr" is known. 

Lemma 2.2. Suppose that it{Ui) >a>0. Then: 

A. Range{A) is closed [and so is Range{A\g) or the range of A restricted 
to any other closed subspace of L2{Q)]. 

B. LetbG L^{P) . Then beV^ if and only if A^6 € Q| . 

Part B of Lemma 2.2 is Lemma A. 6 of Robins, Rotnitzky and Zhao (1994) , 
while part A of Lemma 2.2 is proved in Robins and Rotnitzky [(1992), 
proof of Theorem 4.1, page 326]. It is also in the proof of Lemma A. 4 in 
Robins, Rotnitzky and Zhao [(1994), page 862]. Note that the condition of 
Lemma 2.2 is used by Robins, Rotnitzky and Zhao (1994) in the proofs of 
both parts A and B since RRZ's proof of Lemma A. 6 uses their Lemma A. 2 
which in turn has their (36) as a hypothesis. 

Now suppose that Q,:jr is known. Then for the subspace A4 in the dis- 
cussion earlier in this section we will take V'q + KL, where /C consists of the 
closed subspace of all functions k{U) of the form 

■K vr 

where CiU*^) G 0.^- Moreover, let J be the set of all j{U) such that 

j(f/) = -C(c/°)-^^0(c/?), 

TT TT 

where C,{U^) ^ Q^ ^'^'^ 4'{Ui) is any function in L2{Pijo). The set JT" is dis- 
cussed by Robins, Rotnitzky and Zhao (1994) and by van der Vaart [(1998), 
page 383]. As noted by RRZ, the particular function (^{Ui) in the definition 
of k yields the smallest variance for a given function ^. 

The next three lemmas and two propositions characterize J and /C in 
terms of V and V:^ and show that fC has the desired properties. These 
propositions form the basis for our specific information bound calculations 
for the Cox model in the sections to follow. 

The next lemma shows that every b = Aa € L2{P) for a € L2{Q) can be 
decomposed into the form {R/'k){A^ A)a — II{{R/Tr){A'^ A)a\J^'^^), where 
J'^'^' is the subspace of L^{P) with form of the second term in the definition 
of the class J . The following Lemma 2.3 is a special case of Lemma A. 3 of 
Robins, Rotnitzky and Zhao (1994). 
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Lemma 2.3. Suppose a{U^) e L^Q). Then Aa _L ^0([/?) for all a G 
L2{Q), 0S^2(<5); equivalently 

(2.7) Aa = -{A^A){a)-n(-{A^A){a) 



vr Vvr 

where J^^^ = {{R/t: - l)</)([/0) :</)([/?) G LsCQ)}. 



^(2) 



Proposition 2.1. ICc J cV^ andV^ nV cJC. 

Remark. Any function b G (V^)'^ with (6, Z^) = J, the d x d identity 
matrix, is an influence function for estimation of ^ G C M in the model V 
for the observed data (see, e.g., BKRW, page 73, Proposition 3.4.2), and the 
decomposition of 5 = Aa given in (2.7) shows how b = Aa G ('P,f )"' is related 
to the influence function A^Aa for estimation of 9 in the model Q for 
the complete data. Note that (/2/7r)(A A)a is then the influence function 
of an inverse-weighted estimator of in P of the basic type proposed by 
Horvitz and Thompson (1952). 

Lemma 2.4. Given a subspace M of L^{P) there is a unique projection 
map n(-|A^) from L^{P) onto A4. In particular: 

A. For a* G P,;, h G L2{P) we have a* = Il{h\Vri) if cind only if ih — a* , 
a) =0 for all aGVr^. 

B. For b* eP^, he L^{P) we have b* = U{h\V^) if and only if {b, h - 
b*) = Q for allbeV^. 

C. For b* eV^nV, h£V, we have b* = U{h\V^ n V) if and only if 
{b, h-b*)=0 for all beV,jr\ V. 

D. Suppose that he M. with V d M. C. L^{P), M. a closed subspace. Then 
for b* eV^n M, b* = U{h\V^ n M) if and only if {b, h - b*) = for all 

beV^nM. 

E. For heV, the projection n(/i|P^ n M) = Ul.hl'P^ n P) G P. 

The following proposition is an immediate consequence of Proposition 2.1 
and Lemma 2.4, part E. It is a special case of Proposition 8. lei of Robins, Rotnitzky and Zhao 
(1994). 

Proposition 2.2. i; = U[ig\JC] = |C* - ^E[C\U^], where i; is the 
efficient score for 9 in model V and (^* is the unique element of {Q.'^)'^ 
satisfying 

(2.8) u(-C -^-^^E[C\U^^] Q^)=lf, 

where if^ is the efficient score for 9 in the complete data model Q. 
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3. The Cox model with missing covariate data. In this section we dis- 
cuss the efficient score and information calculations for Cox regression mod- 
els with missing covariates as introduced in the example in Section 2. This 
model is very useful in epidemiology studies, especially in two-stage de- 
signs (also called two-phase designs) where the probabilities vr are deter- 
mined by the investigator. Equation (2.1) gives the joint density of the com- 
plete data {Y,A,X,V) = {Y,A,Z) and (2.2) gives the joint density of the 
observed data (Y,A,RX,V,R). The finite-dimensional parameter 9 is the 
parameter of interest and the nuisance parameter t] = (A,Ag',/i) is a vec- 
tor of three infinite-dimensional nuisance parameters. We will use Propo- 
sition 2.2 to obtain the efficient score Ig in the observed model V for Cox 
regression. The efficient score function in the full model Q, Iq^, has been 
studied by many authors, such as Efron (1977), Andersen and Gill (1982), 
Begun, Hall, Huang and Wellner (1983) and Bickel, Klaassen, Ritov and Wellner 
(1993). Hence our main job will be to characterize the space Q^. 

3.1. The nuisance parameter tangent space Q:j: of the full data model Q. 
The scores for the parameter of interest 9 and the score operators for the 
nuisance parameters A, Xq and h in the "full" model Q given in (2.1) are 
the following: 

(3.1) /?(y,A,Z) = /^(y,A,Z) = AZ-ZA(y)e^'^= f ZdM{t), 

p2{Y,A,Z)^ila{Y,A,Z) 

(3.2) y 

= Aa{Y) -e^^ a{t) dA{t) = / a{t) dM{t), 

qiY,A,Z)^il^b{Y,A,Z) 

{l-A)b{Y,Z)- f b{t,Z)dAGit\Z) 
Jo 

b{t,Z)dMG{t), 
(3.4) il{Y, A, Z) ^ ilc{Y, A, Z) = c{Z), 



(3.3) 



where M and Mq are martingales, conditional on Z, for the failure and 
censoring counting processes, respectively; 

M{t) = A1{Y <t)- f 1{Y> s) dA{s\Z), 
Jo 

Mcit) = (1 - A)l(y < t) - / i(y > s) dAcislZ), 
(3.5) ■'^ 
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d 

o(t) = — logA^(t), 

b{t,Z) = — log Xc^tlZ), c{Z) = —logh^{Z), 
oyj OK 

for regular parametric submodels {A^^}, {Ag^} and {h^} passing through 
the true parameters A, Xq and h when x = 0? "0 = ^^^ k = 0, respectively. 
Here we abuse notation slightly by writing A(-) for the baseline cumulative 
hazard function and A(-|Z) for the cumulative hazard function conditional 
on Z. Under the proportional hazards assumption A(-|Z) = A(-)e . 
Then the scores in the observed model (2.2) are the following: 

k{Y, A, RX, V, R) = Ri^(Y, A, Z) + {I - R)E{i^{Y, A, Z)\Y, A, V}, 

i = l,...A. 
Hence the tangent spaces for the parameters in the two models are 

(3.6) Qi = [i^], ri = [ii], i = i,...,4. 

Here [a] denotes the closed linear span of the set a in L2{Q) or L2{P) as 
in Bickel, Klaassen, Ritov and Wellner [(1993), page 49]. By definition all 
the elements in Qi and Vi, z = 1, . . . ,4, are square integrable. It is easy to 
see that Q2, Q3 and Q4 are mutually orthogonal. Since they are closed (by 
definition), the nuisance tangent space is Q-q = Q2 + Qs + Qi- 
Let Wi and W2 be subdistributions on M'^"'"^ defined by 

Wi{y,z) =Q{Y<y,Z<z, A = l) 

{l-G{t\z'))dF{t\z')dH{z'), 



l{—oo,z] J(0,y] 

W2{y,z) =Q{Y<y,Z<z, A = 0) 



{l-F{t\z'))dG{t\z')dH{z'), 

{-oo,z]J(0,y] 

corresponding to the uncensored and censored data. Hence W = Wi + W2 
is the marginal distribution of (Y,Z). Then we can define L2 spaces corre- 
sponding to the subprobability measures 

L2{Wi) = i^uiY,Z):J Ju\y,z)dW^{y,z) < ooj, i = 1,2. 

These spaces will be used in characterizing Q,2, Q3 and thus Q^. It is easy 
to see that L2{W) = L2{Qy,z) = L2iWi) n L2{W2), L2{Qt,z) C L2{Wi) and 
L2{Qc,z) C ^^2(^2)- Here Qy,z, Qt,z and Qc,z denote the marginal distri- 
butions of {Y,Z), {T,Z) and '(C,Z) 'under Q G Q. 
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Then the conditional distribution and conditional subdistributions of Y 
given Z, W{y\z), Wi{y\z) and W2{y\z) have the following forms: 

Wiy\z)= r {l-G{t\z))dF{t\z)+ r {l-F{t\z))dG{t\z) 
Jo Jo 

= !-{!- F{y\z)){l-G{y\z)), 

W,{y\z)= j\l-G{t\z))dF{t\z), 
Jo 

W2{y\z)= j\l-F{t\z))dG{t\z). 
Jo 

Now we define two operators Ri and R2 as follows: 
(3.7) Ri6(y,z) = 6(2/,l,z) 



j^ b{t, 1, z) dWi{t\z) + bjt, 0, z) dW2{t\z) 



1-W{y\z) 

,,,, „ ,, , ,, , , Iy^Kt,l,z)dW,{t\z) + b{t,0,z)dW2{t\z) 

(3.8) K2biy,z) = biy,0,z)-^ l-W{y\z) ' 

They can be rewritten as 

(3.9) Ri6(y, z) = b{y, l,z) - E[b{Y, A, Z)\Y>y,Z = z], 

(3.10) R26(y, z) = b{y, 0, z) - E[b{Y, A, Z)\Y>y,Z = z]. 

We will show later in Proposition 3.1 that Ri and R2 map L2{Q) to L2{Wi) 
and L2{W2), respectively. These operators are similar to the R operator 
discussed by Ritov and Wellner (1988), Efron and Johnstone (1990) and 
Bickel, Klaassen, Ritov and Wellner (1993). 

The following Proposition 3.1 plays a key role in characterizing the space 
Q^ which will be used to derive the efficient score Ig for 9 in the model V 
in the next subsection. 

Proposition 3.1. Any function b G L2{Q) can be decomposed as fol- 
lows: 

(3.11) b{Y,A,Z)= fRib{t,Z)dM{t)+ f R2b{t, Z) dMcit) + E{b\Z), 

where 

(3.12) Kib{Y,Z)eL2iWi), R26(y,Z)eL2(W2). 

The decomposition is unique in the sense that Ri6 is unique a.e. W\ and 
R26 is unique a.e. W2. 

To prove Proposition 3.1, we will use the following two lemmas. 
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Lemma 3.1. For the failure counting process martingale {M{t) :t > 0} 
defined by (^3.5j, we have 

(3.13) f hi{t,Z)dM{t)eLl{Q) ifandonlyifhi{Y,Z)eL2{Wi), 
and similarly for the censoring counting process martingale, 

(3.14) / /i2(i, Z) dMcit) G L^(Q) if and only if h2{Y, Z) G L2{W2). 

Proof. By using the methods of Chapter 6 of Shorack and Wellner 
(1986) we can easily show that 

e( f hi{t,Z)dM{t)] = ffhl{t,s)dWi{t,s), 

E(^Jh2{t,Z)dMG{tij =JJhl{t,s)dW2{t,s). 
The zero means are trivial. D 

Lemma 3.2. For any functions hj ■.M+ x M'^ ^ M'' in L2(Wi), j = l,2, 



E 



hi{t,Z)dM{t) / h2{t,Z)dM{t) 



E[AhiiY,Z)h2iY,Z)]. 



Similarly, for any functions hj : M"*" x M^ ^ M'^ in ^2(^2); j = 1) 2, 



E 



hi{t,Z)dMGit) / h2{t,Z)dMG{t) 



E[{l-A)hi{Y,Z)h2{Y,Z)]. 



Proof. This follows from Lemma 1 of Sasieni (1992a). D 

Proof of Proposition 3.1. Equation (3.11) can be verified directly 
by the definitions of Ri and R2 operators in (3.7) and (3.8). See Nan (2001) 
for details. By Lemma 3.1 we know that the right-hand side of (3.11) is in 
Ll{Q) if (3.12) is true. Now we show (3.12). Let 



m{y,z) 



J^ bit, 1, z) dWi {t\z) + 6(t, 0, z) dW2{t\z) 



l-W{y\z) 

Obviously, b{Y, 1,Z) £ L2{Wi) and b(Y,0,Z) G L2{W2) since 
E[b\Y, A, Z)] = E[Ab\Y, 1, Z) + (1 - A)b\Y, 0, Z)] 

b\y,l,z)dWiiy,z)+ [[b\y,0,z)dW2iy,z)<oo. 
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Thus we only need to show that m(Y,Z) € L2{W). We rewrite m{Y,Z) as 

m{Y, Z) = ^_^^Y\Z) Iy ^^^*' ^' ^^"^*' ^^ ^ ^^*' °' ^^^^*' ^)> ^^(*l^)' 
where 

It is obvious that < a < 1 and < /3 < 1. Then by the same argument as 
that on page 423 of Bickel, Klaassen, Ritov and Wellner (1993), we have 

E[m^{Y,Z)\Z] 

<8| f b^{t,l,Z)a{t,Z)dWi{t\Z)+ f b^{t,0,Z)P{t,Z)dW2{t\Z)\ 

<8{ fb'^it,l,Z)dWi{t\Z)+ fb\t,0,Z)dW2{t\Z)\. 
Then by Fubini's theorem we have 

E[m^{Y,Z)]<8f ffb'^{t,l,s)dWi{t,s)+ f f b^{t,0,s)dW2{t,s)\ < oo. 
Actually, from the above proof we have also shown that 

L''2{Q) = Uo{Z) + Jhi{t,Z)dM{t) 

+ Jh2{t,Z)dMG{t):hoe4{H),hieL2{Wi),h2€L2{W2)y 

The uniqueness can be proved by showing that any two decompositions 
of an element in L2{Q) are identical. Suppose we have 

b= fhi{t,Z)dM{t)+ fh2it,Z)dMG{t) + ho{Zy, 

b= f h[{t,Z)dM{t)+ f h2{t,Z)dMG{t) + h'o{Z). 

Taking the expectation of the square of the difference of the right-hand sides 
of the two equalities and using the orthogonality of M, Mg, and any function 
of Z, by Lemma 3.2 we have 

= I {hi - h[f dWi + y (/i2 - h'2)^ dW2 + J {ho - h'of dH. 

Thus h'o = Kq a.s. if, h'l = hi a.e. Wi and /i2 = /i2 a.e. W2. □ 

Now we are ready to discuss the space Q^. 
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Proposition 3.2. For any function s{Y,Z) G L2{Wi) define the opera- 
tor B by 

(3.15) Bs= [ {s{t,Z) - E[s{Y,Z)\Y = t,A = l]}dM{t). 

Then: 

(i) Bs±Q^ inLliQ). 

(ii) For any b G L^{Q) we have U{b\ Q;j-) = B o Ri6. 
(iii) Q^ = (Q2 + Qs + Qa)^ = {Bs:se L2iWi)}. 

Proof, (i) Since Q2, Q3 and Q4 are mutually orthogonal and M ± Mq, 
we have 

n(Bs|Q2 + Q3 + Q4)=n(Bs|Q2) + n(Bs|Q3)+n(Bs|Q4)=n(Bs|Q2). 

Let r7i{Y,A,Z) = Js{t,Z)dM{t). Then n(m|Q2) =Ja*{t)dM{t) for some 
a*(y) G L2(VFi) satisfying 



E 



{s{t,Z)-a*{t)}dM{t) / a{t)dM{t) 







for any a(y) G L2{Wi). Now by Lemma 3.2 the left-hand side of the above 
equation is equal to 

E[A{s{Y, Z) - a*{Y)]a{Y)] = E[{E[As{Y, Z)\Y] - a*{Y)E[A\Y]]a{Y)] 

and hence 

So Bs _L Q2, and this yields Bs _L Q2 + Qa + Q4 = Q,?- 

(ii) From Proposition 3.1 we know that for any b G -^2(Q) ^^ have the 
decomposition (3.11). Hence, from the proof of part (i) we know that 

n(6|Q2 + Q3 + Q4) 

= n(6|Q2) + n(6|Q3) + n(6|Q4) 



/ E[Yiib{Y, Z)|y = i, A = 1] dM{t) 
+ j R26(t, Z) dMcit) + E{b\Z). 



Thus, 



n(6|(Q2 + Q3 + Qa)^) = b - n(6|Q2 + Q3 + Q4) = B o Ri&. 
(iii) This is an immediate consequence of (i) and (ii). D 

If we choose s(t, Z) = Z, then Bs is the efficient score for 9 in the ("full" 
or "complete" data) model Q. 
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3.2. Efficient score for 9 in observed model V . In this subsection we will 
use the results in Section 2 and the previous subsection to derive the effi- 
cient score /g of V. Since from Proposition 3.2 we know that Q^ = {Bs, s G 
-^2(1^1)}, for the model V, with P £V, we define the class IC of all functions 
with the form 

k{Y, A, RX, V, R) = -B{s{Y, X,V))- ?—^E[B{s{Y, X, V))\Y, A, V], 
vr vr 

where s G L2{Wi). Note that we can rewrite the functions k in terms of the 
operator D : L2{Wi) -^ -^^(Q) defined by 

(3.16) T>u{Y,A,Z)= fu{t,Z)dM{t). 

Thus Bs = D o IIis = Du, where 

(3.17) u{Y, Z) = His = s(y, Z) - E[s{Y, Z)\Y, A = 1]. 
Then we have the following proposition: 

Proposition 3.3. The efficient score Ig for 9 in the model V for the 
observed data is given by 

i;{R,Y,A,R-X,V) 

= k* = -Uu*{Y,A,Z)-^—^E[Du*{Y,A,Z)\Y,A,V], 
vr vr 

where u* G {L2{Wi))'^ is the unique a.e. Wi solution of the equation 

UiZ = Hi o RiJd-u* + (^— ^) [D-u* - E(Du*\Y,A, V)]\ 

= u* + Tu*, 
where 

(3.19) T^HioRioH 
and 

(3.20) iiu*= (^—^yUu* - E{-Du*\Y,A,V)]. 

Corollary 3.1. The function u* in Proposition 3.3 also satisfies, equiv- 
alently, 



(3.18) 



n*(y,Z)-|Kn*(y,Z) 



= Ti{Y,l,V){Z - E[Z\Y,R^ = \\], 
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where the operator K. is a linear operator defined by 

Ku*{Y,Z) 

= -E[Du* {¥', A, Z)\Y'> Y, Z] 

^ Y' >Y,Z 



(3.22) 



+ (i-7r(y,i,y))^[Dn*|y,A,y]|A=i 

•l-7r(y',A,F) 



■T,{Y,l,V)E 



7r(y',A,y) 

xE[T>u*{Y',A,Z)\Y\A,V] 



Y'>Y,Z 



Our proof of Proposition 3.3 will use the following lemma. 

Lemma 3.3. Denote the adjoint ofD by D'^:L^(Q) ^ L2{Wi). Let Hi 
be defined as in (^3.17j. Then: 

(i) D^ = Ri. 
(ii) RioD = I. 
(iii) III is a projection operator on L2{Wi). 

Proof, (i) Let a e L2{Wi), b G L^{Q). Then we have (a,'D'^b)L2iWi) = 
{'Da,b)i^o/Qy By Proposition 3.1, 

b{Y,A,Z)= fRib{t,Z)dM{t)+ fR2b{t,Z)dMG{t)+E[b{Y,A,Z)\Z]. 
Then we have 

(Da, 6)^0 ^Q^ = EQya{t, Z) dM{t) J Ri6(t, Z) dM{t) | 
= ^Q{Aa(y,Z)Ri6(y,Z)} 

a(i, z)Ri6(i, z) dWi {t, z) = {a, Ri6)i2(p^^) . 



(ii) For all h € L2{Wi), by Proposition 3.1 we have D/i = J hdAI = /Rio 
T>hdM and thus /i = Ri o D/i a.e. Wi. 

(iii) That n2(-) = E{-\Y,A = 1) is a projection operator on L2{Wi) can be 
shown by checking the three properties in Proposition A. 2. 2 of Bickel, Klaassen, Ritov and Wellner 

(1993). Sois ni = i-n2. □ 
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Proof of Proposition 3.3. We use Proposition 2.2 directly to prove 
Proposition 3.3. For the Cox regression model we have l^^ = BZ = D o IIiZ 
and C* = Du*. Let 

b{Y, A, Z) = BZ - -Du* + ^—^E[Du*\Y, A, V]. 

Then by Proposition 2.2 and Proposition 3.2(ii) we have n(6| Q,;r) = B o 
Ri6 = 0. Since by Lemma 3.3 we have B o R]^6 = D o Hi o R^fe = D o 
niD^6 = D o (D o ni)^6 = D o B^b and Ri o D = I, B o R^ft = implies 
Ri o B o Kib = Ri o D o B^b = B^b = which is 

(3.23) B^ (bZ - -Dn* + ^—^E[Du*\Y, A,V]) = 0. 

V vr vr / 

Since u*{Y, Z) = s*{Y, Z) - E[s*{Y, Z)\Y, A = 1] G L2{Wi) satisfies 

(3.24) E[u* {Y,Z)\Y, A = 1]=0, 

we must solve the pair of equations (3.23) and (3.24) for the function u* . 
Note that Hi is a projection operator and thus we have 

B^ o BZ = nf o D^ o BZ 

= Hi o Ri o D o UiZ = UlZ = UiZ = Z- E{Z\Y, A = 1). 

Hence (3.23) can be rewritten, using Lemma 3.3 (ii) and (iii), as 

Z-E[Z\Y,A = 1] 

= UioB.J^^-^—^E(Du*\Y,A,V)\ 

[ TT TT J 

= UiokJ'Du* + ^—^{'Du* - E(Du*\Y,A,V))\ 
= u* + Tu*, 



(3.25) 



where T = Hi o R^ o H and H is given by (3.20). Thus (3.18) holds. 

To see that the solution is unique, we argue as follows: let (* = 'Du* . Then 
from Proposition 2.2 we know that C* € (Qn ) is the unique solution of the 
operator equation 



U[-C-- — -E[C\Y,A,V] 
\7r vr 



Q 



-L I _ 7*0 



Suppose we have 

(* = Bui = [ ul{t, Z) dMit) and C = ^y*2 = I u*2it, Z) dM{t). 
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Then taking the expectation of the square of the difference of the two equal- 
ities componentwise and using Lemma 3.2 (componentwise), we have 

0= f \ul - u^l"^ dWi. 

It follows that ul = ul a.e. Wi. D 

Proof of Corollary 3.1. It also follows from (3.25) that 
(3.26) Z-E[Z\Y,^ = l] 

Dn* l-TT 



Hi o Ri 
Ri 



vr vr 

Dn* 1 - vr 



-^(D'u*|Y',A,F) 



^ Ri 



vr vr 

Dn* 1-vr 



Ri 



vr vr 

Dn* 1-vr 



^(Dn*|y,A,F) 

^(Dn*|y,A,F)}> y,A=i 



vr 



vr 



(3.27) 

Now 

/Dn*\ Dn*|A-i /Dn*(y',A,Z) 

(3.28) Ri( I = ..,'.. -E' ^ ' ' ^ 



vr 



E{'Du*\Y,/\,V)\- U.{Y). 



Y'>Y,Z 



Tr{Y,l,V) V AY',A,V) 



and 



(3.29) 



Ri( - — -e[j:)u*\y,a,v] 



vr 



i-vr(y,i,y) 

7t{Y,1,V) 
^fl-7riY',A,Z) 



7r{Y',A,V) 
Substituting (3.28) and (3.29) into (3.27) yields 

Z-E[Z\Y,A=l] + fu*{Y) 



E[T>u*\Y, A,V]\a=i 

EiDu* (y' , A, z) |y'. A, F] y' > y, z ) . 



Dn*|A=i ^^Dn*(y',A,Z) 



(3.30) 



7r{Y,l,V) V 7r{Y',A,V) 

i-vr(y,i,y) 



Y'>Y,Z 



+ E 



n(Y,l,V) 
1-7t{Y',A,V) 

vr(y',A,y) 



^[Dn*|y,A,y]|A=i 



^[Dn*|y',A,V"] 



Y'>Y,Z 



1 



(n*(y,Z)-Kn*(y,Z)). 



INFORMATION BOUNDS FOR COX MODELS WITH MISSING DATA 19 

Here we use 

Dn*|A=i = Ri o Dti*(y, Z) + E[Du*{Y', A, Z)\Y' >Y,Z], 

and the operator K is defined as (3.22). Then we have 

(3.31) u*{Y, Z) - Ku*{Y, Z) = ^{Y, 1, V){Z - E{Z\Y, A = 1) + /„, {Y)}. 

Thus, taking conditional expectations given y, A = 1 and using E{u*\Y, A = 
1) = as required by (3.24), 

-E[Ku*\Y,A = 1] = E[7t{Y,1,V)\Y,A = 1]U{Y) 

+ E[7r{Y,l,V)iZ-E{Z\Y,A = l))\Y,A = l]. 

Solving this for fu* (Y) yields 

E[Ku*\Y,A = l] 



(3.32) 



Note that 



■^"'^^^ ^[^(y,i,y)|y,A = i] 

E[7r{Y,l,V){Z-E{Z\Y,A = l))\Y,A = l] 
E[7r{Y,l,V)\Y,A = l] 

E[7t{Y,1,V){Z-E{Z\Y,A = 1))\Y,A = 1] 



E[7r{Y,l,V)\Y,A = l] 

^ ' E[tt{Y,1,V)\Y,A = 1] ^ ' ' ^ 

= E[Z\Y,A = l,R = l] - E[Z\Y, A = l\. 
Combining (3.32) and (3.33) with (3.31), we obtain (3.21). □ 

The reason that we solve for u* instead of s* is that there is no unique 
solution if we solve for s*: if s*{Y^Z) satisfies (3.23) with u* replaced by 
s* -E[s*\Y,A = l\, then s*{Y,Z) + f{Y) will satisfy (3.23) for any func- 
tion f{Y). From the form of /c G /C we know that k is determined by u(y, Z) = 
s(y, Z) -E[s{Y, Z) |y, A = 1] . So we only need to solve for u* {Y, Z) = s* (Y, Z) - 
E[s*{Y, Z)\Y, A = 1] and then compute l^ = k*. 

The parts of (3.18) and (3.21) involving T and K, respectively, can be 
viewed as integral operators on the unknown function u* . But these equa- 
tions are not standard Fredholm integral equations of the second kind [see, 
e.g., Kress (1999) and Rudin (1973), Chapter 4]. Note that the terms of K 
involving conditional expectation given Y' >Y correspond to noncompact 
operators [see, e.g., Rudin (1973), Problem 17, page 107] in general. 

Remark. The equation corresponding to our (3.18) given by Robins, Rotnitzky and Zhao 
[(1994) Section 8.3, page 862, column 1] is incorrect. According to personal 
communications with Robins, their incorrect result is due to algebraic errors. 
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3.3. Alternative models. The survival model we have discussed so far 
assumes that Y is always observable and V is also a vector of covariates of 
interest. We now consider some alternative scenarios for the models involved 
in the two-phase designs. 

When Z = {X, V) but the Cox model only involves X . When Z = {X, V), 
but the Cox model for the conditional distribution of T given Z includes 
only X, then 9'z is replaced by 6'x in the model (2.1). After going through 
the same procedure as deriving equation (3.18), we obtain 

u*{Y,Z)-Iku*{Y,Z) 

= 7r(F,l,F){X-E[X|y,RA = l]), 
where the operator K is as defined in (3.22). 

When Y is not observed at phase 1. If we are not able to observe Y in 
the first phase, we will observe the data 

{Y.A.X.VfiA.V)^-" 

and the efficient score will have the form 

k* = -T>u*{Y,X,V) - ^^—^E[Bu*{Y,X,V)\A,V]. 
vr vr 

By the same method used to derive (3.18), we find, for estimation of the 
coefficient of Z, that u* satisfies 

(3.35) ■"■''■■^> " {''"'<^'^> " E[.(l"K)iF.A = ll ^"^"''^-^>l^'^ = " 

= 7r(hV){Z-ElZ\Y,Ri\=l]], 
where 

Ku*{Y, Z) = -E\Du*{Y', A, Z)\Y' > Y, Z] 

(3.36) 

+ (l-vr(l,y))^[Du*|A,F]|A=i 



HhV)E( L_^^XlE[nu*{r,A,Z)\A,V] 



Y'>Y,Z 
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When Y is not observed at phase 1, Z = {X,V), and the Cox model only 
involves X. When the Cox model involves only 9'X and just (A,y) is 
observed in the first phase, u* must satisfy 

n*(y, Z) - (Ku*(y, Z) , , ''^^''^\ -E\¥.u*{Y, Z)\Y, A = 111 

(3.37) ' \ ^ ' i?[vr(l,F)|y,A = l] L I ' ^1 - JJ 

= ^(1, V){X - E[X\Y, RA = 1]}, 

with K defined in (3.36). 

When Z = {X,V), V = (Vi, V2) and the Cox model only involves {X,Vi). 
Suppose that V = (Fi, V2) and that T and V2 are conditionally independent 
given (X,Vi), that is, the Cox model for the conditional distribution of T 
involves only {X,Vi). This is a generalization of the previous models. In 
order to avoid repeating calculation of the efficient score function, we can 
reduce the general model to the model either in Section 3.1 or in Section 3.3 
and apply those existing results. Let X = {X,Vi) and assume that X is 
missing at random. Thus X and V would, respectively, play the role of 
X and V in our Sections 3.1 and 3.2. Although Vi can be missing in X, it is 
fully recovered from V. Then we can directly use the results of alternative 
models in Sections 3.1 and 3.3 for the general model. 

4. Examples of information bound calculations. The case-cohort design, 
studied by Prentice (1986) and Self and Prentice (1988), and the exposure 
stratified case-cohort design, studied by Borgan, Langholz, Samuelsen, Goldstein and Pogoda 
(2000), are two special cases in the class of two-phase designs. In the case- 
cohort design the complete information is essentially observed for all the 
failures and a simple random subsample of the nonfailures. The exposure 
stratified case-cohort design is a modification of the classical case-cohort de- 
sign in which complete covariate data is observed for all failures and for a 
stratified random subsample of the nonfailures. The stratification is based 
upon a correlate (or surrogate variable, available for everyone) of the true 
exposure (or prognostic factor) of interest. In this section we treat the sim- 
plified i.i.d. versions of these sampling designs. 

Pseudo-likelihood type (inefficient) estimators have been proposed by 
Prentice (1986) for case-cohort designs, and by Borgan, Langholz, Samuelsen, Goldstein and Pogoda 
(2000) for exposure stratified case-cohort designs. For discussions of efficient 
estimators for these designs we refer to Nan (2001). But information bound 
calculations can tell us how much information we could potentially gain 
from fully efficient estimators and which design methods use the observed 
data more efficiently, if efficient estimators were available. Here we give two 
examples in which the information bound calculations can be carried out an- 
alytically. Although these two hypothetical examples are rather special cases 
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involving evaluation of the information for a simple parametric subfamily, 
the calculations in this section may tell us the fundamental properties of the 
designs and potential estimators. The results may give some guidance for 
designing and analyzing real studies. 

4.1. A case- cohort study. We assume that the true distribution has ex- 
ponentially distributed failure times and a single binary covariate Z taking 
values and 1. Let h{z) = P{Z = z) be the probability that a subject has 
covariate value z € {0, 1}; thus /i(0) + /i(l) = 1. The censoring time is dis- 
tributed with point mass 1 at t = 1, which means that all subjects in the 
cohort are followed from time zero to either failure or to the end of the study 
at t = 1. This is discussed as an example by Self and Prentice (1988). The 
density of the complete data can be written as 

,,,. , . , jwi(y\z)h(z) = Xe^^~^y^''^h(z), 6 = l,0<y<l, 

[W2iy\z)h{z)=e ^^ h{z), S = 0,y = l. 

In the i.i.d. version of the case-cohort study, a simple random subsample is 
taken from the nonfailures with sampling (inclusion) probability vro . The val- 
ues of the covariate Z are measured for all the failures and only the sampled 
nonfailures. Hence it is a special case of the two-phase design discussed in 
the previous section with 7r(y, A) = P[R = l|y, A) = 7r(A) only, and where 
7r(l) = 1, 7r(0) = ttq € (0, 1). Note that we do not have a surrogate covariate 
V in this example. In a classical case-cohort design, Y may not be observed 
if the subject is not a failure and not in the subcohort. But for this special 
example A = implies Y = 1. So for information bound calculations it does 
not matter whether we treat Y as known or not. Detailed calculation is 
omitted here and can be found in Nan (2001). 

Figure 1 displays the ratios of asymptotic variance of the Self and Prentice 
(1988) pseudo-likelihood estimator (SP Variance) to the information lower 
bound for ^ as a function of the sampling fraction for nonfailures in the i.i.d. 
case-cohort model shown above. Figure 1 shows that when the disease is rare, 
that is, the baseline failure probability is very low, the pseudo-likelihood es- 
timator is close to fully efficient. As the failure probability increases, the 
pseudo-likelihood estimator loses more efficiency, especially when the sub- 
cohort fraction is small. Hence development of more efficient estimators may 
be worthwhile for case-cohort designs where increasing the subcohort frac- 
tion is costly and the failure probability is moderate. 

Figure 2 displays the ratio of the information lower bound for estimation 
of 9 based on the "observed data" {1/Ig where Ig is the information for 9) 
and the asymptotic variance of the partial-likelihood estimator for 6 based on 
"complete" (or "full" ) data. This ratio is shown as a function of the sampling 
fraction for the nonfailures under different baseline failure probabilities when 
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Fig. 1. Ratio of the variance of the Self and Prentice pseudo-likelihood estimator (SP) to 
the Optimal Variance as a function of the sampling fraction for nonfailures under different 
baseline failure probabilities, P{T < 1\Z = 0). Here 6 — ln(2). 

e^ = 2. Figure 2 shows that the case-cohort design loses more information 
(supposing that an efficient estimator is available), relative to complete data, 
as the failure probability increases and as the subcohort fraction decreases. 
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Actually, when the baseline failure probability is above 0.5, the curves in 
Figure 2 move toward the upper left again as the failure probability increases, 
but there is less interest in these high failure probability cases in practice. 
From Figure 2 we can see that a great deal of precision may be lost by using 
a case-cohort design as opposed to data collection on the full cohort, even 
when a fully efficient estimator is used for the case-cohort study. With this 
knowledge investigators can weigh the trade-off between precision and study 
cost. Further work is needed to explore presumably more efficient designs: for 
example, an alternative design might be an "exposure stratified case-cohort 
design" as in our second example. 

Perhaps the more interesting phenomena appear in Figure 3 and Figure 4. 
In Figure 3 we look at the asymptotic relative efficiency of the pseudo- 
likelihood estimator as a function of 6. Figure 4 shows the relative efficiency 
of the optimal variances for the i.i.d. case-cohort design versus the full data 
design as a function of 9. When 9 is near zero Figure 3 shows that the pseudo- 
likelihood estimator does not lose much efficiency compared to the optimal 
estimator for the case-cohort design. However, Figure 4 shows that the case- 
cohort design (with an optimal estimator) loses considerable information 
compared to the full data design. The minimum ARE (as a function of 9) 
depends on the baseline failure probability; the minimum increases and it 
moves away from 9 = as the baseline failure probability decreases. When 9 
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Here the sampling fraction for nonfailures is 0.1. 

is away from zero, that is, the effect of the covariate Z is large, the pseudo- 
hkelihood estimator loses significant efficiency, especially when 6 is positive 
and the baseline failure probability is high. However, away from zero the 
design itself starts to gain information and is very close to the full data 
design when the absolute value of 9 is large. The conclusion is that if we 
expect intermediate to large covariate effects, it may be very worthwhile to 
find efficient estimators for 0. Certainly, developing more efficient designs is 
also valuable, as can be seen from Figures 2 and 4. 

4.2. An exposure stratified case-cohort study. Assume that X is the vari- 
able of interest and that y is a surrogate variable for X, or measurement 
of X with error, and V is conditionally independent of T given X . We sup- 
pose that V can be observed for everyone in the entire cohort, but X is 
only observed for subjects in the subcohort and failures. Then the model 
for this type of data is the first alternative model discussed in the previ- 
ous section. The i.i.d. version of the exposure stratified case-cohort design 
studied by Borgan, Langholz, Samuelsen, Goldstein and Pogoda (2000) is a 
special case of this model. Here we discuss an example with a binary covari- 
ate X G {0, 1} and a binary surrogate variable V S {0, 1}. The distribution 
for X and V has the form of a 2 x 2 table. Let 



P{V = 1\X = 1) = 1 



a. 



p(y = o\x = o) = i-fi. 
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If we consider V = 1 as a "positive" test for X, then 1 — a is the sensitivity 
and 1 — /3 is the specificity of the test. We assume exponentially distributed 
failure times. All subjects in the cohort are followed from time zero to either 
failure or to the end of the study at time t = 1. For our calculations the 
exponential failure rate parameter A will be set to achieve a specified baseline 
failure probability as in Section 4.1. Let the joint mass function of {X,V) 
be h{x,v). Thus we have the joint density for the underlying complete data, 



(4.2) qY,A,x,v{y,S,x,v) 






'h{x,v), S=l,0<y<l, 
h{x,v), 6 = 0, y = 1. 

By the same argument as in the previous example, we may assume that 
Y always is observed. The cohort is then categorized into three strata: 
{A = 1}, {A = 0, y = 0} and {A = 0, 1/ = 1}. We observe complete infor- 
mation for all the subjects in the first stratum, and of ttq, vti fractions 
(constants) of the subjects in the second and third strata, respectively. We 
only observe {Y,A,V) for other subjects. In probability language we have 
P{R = 1\A = 1,Y,V) = Tr{Y,l,V) = 1, P{R= 1\A = 0,Y,V) = Tr{Y,0,V) = 
ttq if y = and vri if y = 1 . Again, we omit the detailed calculations here 
and refer to Nan (2001). 

We calculate Ig for different a, /3, P{X = 0), ttq, tti, 9 and A by using 
numerical integration. When a = /5 = 0.5, the exposure stratified case-cohort 
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Fig. 5. Asymptotic efficiency of the optimally efficient estimator for the case-cohort de- 
sign with a surrogate variable, relative to that for the full cohort design, as a function of the 
baseline failure probability, P{T < 1\X = 0). Here 9 = ln(2), P{X = 0) = 0.9, ttq = tti = 0.1 
(i.e., stage 2 sampling is not stratified). 
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Fig. 6. Asymptotic efficiency of the optimally efficient estimator for the i.i.d. strati- 
fied case-cohort design with a surrogate variable, relative to that for the full cohort de- 
sign, as a function of the baseline failure probability P{T < 1\X = 0). Here, 6 = ln(2), 
P[X = 0) = 0.9, TTo f^ TTi [i.e., with stratified sampling at stage 2). 



design is equivalent to the classical case-cohort design (previous example) 
since V is not correlated with X under this condition. Figures 5 and 6 show 
the comparisons of the asymptotic relative efficiency (ARE) of fully efficient 
estimators (if they exist) for the exposure stratified (at ttq = vri and ttq 7^ vti) 
and classical case-cohort designs at e = 2 and P{X = 0) = 0.9. When = 0, 
the corresponding figures (not shown) have similar patterns but slightly dif- 
ferent magnitude. In Figure 5 the sampling probabilities in the two strata 
are equal, that is, ttq = vri = 0.1. In Figure 6 ttq and vri are different, such 
that the expected numbers of sampled subjects in strata {A = 0, y = 0} 
and {A = 0,y = 1} are the same (or approximately the same), and the 
fraction of sampled subjects from the two strata all together is 0.1 (or ap- 
proximately 0.1). We can see from Figure 5 that the efficiency increases as 
the sensitivity (1 — a) and specificity (1 — /?) increase as a result of incor- 
porating the surrogate variable into the model, even without stratification. 
Note that an efficient estimator will incorporate V for subjects outside the 
subsample, providing information for the estimation of 6 via the correlation 
between X and V . When we do stratified sampling (ttq 7^ tti). Figure 6 shows 
that the efficiency gains are even greater. So both incorporating surrogate 
information and stratified sampling will increase the efficiency. Note that 
the information bound calculation illustrated here is for a measurement er- 
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Table 1 

Comparisons of asymptotic relative efficiency {ARE relative to a full cohort study): the 

approximate AREs of pseudo-likelihood estimators and the AREs of information bounds 

for a stratified case-cohort design, which has only the binary covariate of interest, X, and 

a binary covariate V, which is a surrogate for X. The specificity of V = 1 as a test for 

X — 1 is 1 — /3, and the sensitivity is 1 — a. The subcohort size equals the expected 

number of cases {PL= Pseudo-Likelihood. This part is taken from Table 1 of Borgan, 

Langholz, Samuelsen, Goldstein anf Pogoda (2000). IB = Information Bound^) 

(a) P{X = 1) = 0.05 





ARE{PL), 


% 


ARE{IB), 


% 




ARE(PL) 0/ 
ARE{IB) ' ^° 








1-/3 






1-/3 






1-/3 




1 - a 


0.50 


0.70 


0.90 


0.50 


0.70 


0.90 


0.50 


0.70 


0.90 


0.50 
0.70 
0.90 


35.5 
36.5 
40.8 


36.5 
39.6 
47.3 


40.8 
47.3 
60.5 


36.0 
37.7 
43.9 


37.8 
43.0 
53.0 


45.9 
55.9 
70.3 


98.6 
96.8 
92.9 


96.6 
92.1 
89.2 


88.9 
84.6 
86.1 


(b)P(X: 


= 1) = 


0.50 




















1-/3 






1-/3 






1-/3 




1 -a 


0.50 


0.70 


0.90 


0.50 


0.70 


0.90 


0.50 


0.70 


0.90 


0.50 
0.70 
0.90 


52.9 
54.0 
58.4 


54.0 
57.3 
64.7 


58.4 
64.7 
75.8 


53.5 
55.5 
62.6 


55.5 
61.1 
71.2 


63.2 
72.0 
83.6 


98.9 
97.3 
93.3 


97.3 
93.8 
90.9 


92.4 
89.9 
90.7 



ror problem without making any assumptions on the structure of the joint 
distribution of {X,V). 

Borgan, Langholz, Samuelsen, Goldstein and Pogoda (2000) study inverse 
probability weighted estimators in this model. In order to show how much 
efficiency the inverse probability estimator loses, we calculate the ARE of 
their estimator relative to a fully efficient estimator with asymptotic vari- 
ance given by our information bound l//g for the setting of their Exam- 
ple 1. We choose their optimal sampling fractions and a very small failure 
rate, A = 0.01, which we believe is small enough to be able to make valid 
comparisons to their results. Note that the results in Table 1 are calculated 
under the condition that the subcohort fraction equals the expected number 
of cases, providing approximately one "control" per case, a frequently used 
design. 

5. Conclusions and further problems. We have established new informa- 
tion bounds for the Cox model with missing data. Along the way we have 
developed a new decomposition of L^(Q), characterized the structure of the 
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orthogonal complement of the nuisance parameter tangent space Q^, and 
shown how to project onto the space Q^ using conditional versions Ri and 
i?2 of the mean residual life operator R introduced by Efron and Johnstone 
(1990) and Ritov and Wellner (1988). The new bounds can be used to ex- 
amine the loss of efficiency of pseudo-likelihood estimators for a given design 
and the amount of information loss due to a given design relative to com- 
plete data collection or to an alternative two phase design. While it has been 
known for some time that pseudo-likelihood estimators are not semiparamet- 
rically efficient, our explicit calculations quantify the loss of efficiency and 
also show that two-phase designs with stratified subsampling can partially 
recover the information that is lost due to missing data. 

Further problems. 

1. Construction of efficient estimators when covariates are discrete. Efficient 
estimators can be constructed explicitly using one-step methods when the 
covariates are discrete. For a preliminary study of such estimators, see 
Nan (2001). 

2. Construction of efficient estimators in general. This will depend crucially 
on understanding the properties of the integral equation defining the 
efficient score and influence function. A major difficulty in constructing 
efficient estimators is the fact that the conditional cumulative hazard 
function Aciylz) enters into the key equation (3.18) which determines 
u* and hence the efficient score function Ig. This function is typically 
completely unknown and is a function of d+ 1 variables which must be 
estimated nonparametrically. This is, of course, a difficult task for even 
moderately large d. However, our goal is not to estimate Ac well, but 
instead to estimate 9 well, and it is not yet clear how crucial the difficulty 
in estimating Aq will be for construction of (nearly) efficient estimates 
of 9. We remain optimistic about this at least for moderate values of d, 
and regard this as an important question for future work. 

3. How can we ^^ optimize" the sampling design for a particular study? If 
we focus on the variance of the estimator of a particular regression coef- 
flcient (e.g., the coefficient corresponding to a binary treatment-control 
covariate), then it would be very interesting to know how to allocate the 
sampling effort in the second phase to minimize the (asymptotic) vari- 
ance. Our results provide the tools to graphically address this extremely 
important question. 

4. Are there better compromise estimators based on pseudo-likelihood? Here 
the approaches of Chatterjee, Chen and Breslow (2003) and Chatterjee 
(1999) may be useful. 
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